The minimizer Jaccard estimator is biased and inconsistent

Abstract Motivation Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. Results We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. Availability and implementation Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. Supplementary information Supplementary data are available at Bioinformatics online.

vC a,left , C a,right ; C b,left , C b,right ; sw, where s P t0, . . . , wu and C a,left , C a,right , C b,left , C b,right P t0, 1, 2u. We then say that an index pair pi, jq with i, j P r0, L´w´1s has configuration vC a,left , C a,right ; C b,left , C b,right ; sw if the windows tA i`1 , . . . , A i`w u and tB j`1 , . . . , B j`w u share s k-mers (i.e., s " Spi`1, j`1, wq) and 1 if A i`w P tB j`1 , . . . , B j`w´1 u, 2 otherwise; An index pair pi, jq has exactly one configuration, and not all configurations are possible; in particular, configurations where exactly one of C a,left or C b,left is zero, or exactly one of C b,right and C a,right is zero, are impossible. Figure S1 shows some examples of configurations. We may label configuration elements as sets (e.g. C a,left " t0, 2u) to indicate all the configurations that can be formed using values from that set, except for impossible configurations. We use˚as shorthand for the set t0, 1, 2u of all possible values. For example, v˚, 0;˚, 0; sw refers to the configurations v0, 0; 0, 0; sw, v1, 0; 1, 0; sw, v2, 0; 1, 0; sw, v1, 0; 2, 0; sw, v2, 0; 2, 0; sw. For a configuration C we use N pCq to denote the number of pairs pi, jq such that the configuration of pi, jq is C.

A.2 Proof of Theorem 1
In all the following, we will assume that L ě 7pw`1q.
A.2.1 Approximating the minimizer union and intersection (Lemmas 1 and 3) In this section, we will prove Lemmas 1 and 3. First, we recapitulate the proof of Fact 1 in our notation: Fact 1. Let p P r0, L´1s. Position p is a minimizer in A iff there exists a unique i P r´1, L´w´1s such that p charges index i. In other words, M A p " Proof. Figure 2 gives the intuition for the proof. For the only if direction, suppose that p charges index i. Then, by definition of charging, ap " minta i`1 , . . . , a i`w u, and so p is a minimizer. For the if direction, suppose that p is a minimizer in A. Consider the leftmost window in which it is a minimizer, i.e. the smallest i 1 P rp´w`1, ps such that ap " minta i 1 , . . . , a i 1`w´1u. Since i 1 is smallest, then either i 1 " p´w`1 or a i 1´1 ă ap. This is the definition of p charging index i 1´1 . For uniqueness, consider all the possible windows that p can charge, shown in Figure 2. They are all pairwise incompatible, i.e. there is at least one position that is simultaneously required to be larger than ap and smaller than ap.
The expected value of M A p is called the density of the minimizer scheme, and we compute it exactly in the following Fact. We note that similar derivations of the density also appeared in Schleimer et al. [2003], Roberts et al. [2004], but our proof accounts also for the edge cases.
Proof. Let ℓ " maxp´1, p´wq and u " minpL´w´1, p´1q. For i P rℓ`1, us, we have PrrX A i,p s " When p P rw, L´ws, we have When p P rL´w`1, L´1s, we have We are now ready to prove Lemma 1.
Observe that by definition of charging, X A i,p " 0 when p R ri`1, i`ws. Therefore, We can ignore some of the boundary terms associated with position´1 being charged without much loss in accuracy. Let We claim that Er p Icores ď Er p IpA, B; wqs ď Er p Icores`2. The lower bound is immediate. For the upper bound, let us first separate out the terms of p I with i "´1 or j "´1: For the second term, observe that, by definition of charging, there is at most one value of q for which X B 1,q " 1. Then, since there are no repeated k-mers in A or B, there is at most one value of p for which Ap " Bq. Finally, by definition of charging, there is at most one value of i for which X A i,p " 1. Hence the second term is at most one; by a symmetrical argument, the third term is at most one as well. This gives us the desired upper bound.
It now suffices to show that Er p Icores " CpA, B; wq.
The probability PrrX B j,q " 1, X A i,p " 1 | ap " bq " xs will depend on the configuration of the indices i and j and on whether p " i`w or q " j`w. Therefore, we rearrange the sums as follows. For a configuration c, we say that pi, jq Ñ c when the indices i and j are in configuration c, so that Figure 3 gives some examples to develop the intuition for what the inner term can evaluate to. We consider next each summation Equation (4), Equation (5), and Equation (6) separately. We start with Equation (5). Note that in this case the value of q is fixed to j`w, and so there is at most one value of p in the summation that is not 0 (since Ap " Bq). We partition the space of all configurations into four possible cases: (i) c " v˚, 0;˚, 0; sw, (ii) vt0, 2u,˚;˚, 1; sw, (iii) c " v1,˚;˚, 1; sw, and (iv) c " v˚,˚;˚, 2; sw. First note that for any c, we have X B j,j`w " 1 if and only if b j`1 , . . . , b j`w´1 are each greater than x. In case (i) when c " v˚, 0;˚, 0; sw, the only value of p for which the probability in Equation (5) is not zero is p " i`w. From the definition of charging, we have X A i,i`w " 1 and X B j,j`w " 1 if and only if a i`1 , . . . , a i`w´1 , b j`1 , . . . , b j`w´1 are each greater than x. The number of distinct k-mers in this sequence is 2w´2´Spi`1, j`1, w´1q " 2w´2´Spi`1, j`1, wq`1 " 2w´1´s. Therefore, PrrX A i,p " 1, X B j,j`w " 1 | ap " bq " xs " p1´xq 2w´1´s and recalling that t 0 " 1 2w´s , t 1 " 1 p2w´sqp2w´s`1q , and t 2 " 1 p2w´sqp2w´s`1qp2w´s`2q . For case (ii) with c " vt0, 2u,˚;˚, 1; sw, because C b,right " 1, the only value of p for which the probability in Equation (5) is not zero belongs to ri`1, i`w´1s. From the definition of charging, we have X A i,p " 1 iff a i ă x and a i`1 , . . . , a i`w , with the exception of ap, are all greater than x. As mentioned previously, we have that X B j,j`w " 1 iff b j`1 , . . . , b j`w´1 are each greater than x. Because C a,left ‰ 1, we have A i R tB j`1 , . . . , B j`w´1 u. Therefore, we have one hash value (i.e. a i ) that is less than x, and 2w´2´pSpi`1, j`1, wq´1q distinct hash values that are more than x. As a result, For next two cases (i.e., case (iii) and (iv)) we show that the sum is 0. When c " v1,˚;˚, 1; sw, the fact that C b,right " 1 means that C a,right ‰ 0 which implies that p ă i`w and that, if X A i,p " 1, then a i ă x. The fact that C a,left " 1 implies that A i P tB j`1 , . . . , B j`w u. Therefore, one of the values of tb j`1 , . . . , b j`w u is less than x, which makes it impossible that X B j,q " 1. When c " v˚,˚;˚, 2; sw, there is no value of p P ri`1, i`ws which satisfies Ap " B j`w , so 1pAp " B j`w q " 0. Putting all the four cases together, we have shown that the inner summation in Equation (5) is: Deriving a closed form for Equation (6) is symmetric to Equation (5) with the exception that when c " v˚, 0;˚, 0; sw, there is no value of q in the range of the sum (i.e. q P rj`1, j`w´1s) such that A i`w " Bq. Hence, for the inner summation in Equation (6), we obtain With a similar but more delicate case-by-case analysis, we also derive a closed form for Equation (4), whose proof we postpone until later. Then, T " ÿ w s"0 st 1 N pv0, 2; 0, 2; swq`2st 2 N pv2, 2; 2, 2; swq`2ps´2qt 2 N pv2, 1; 2, 1; swq ps´2qt 1 N pv0, 1; 0, 1; swq`ps´1qt 1 pN pv0, 1; 0, 2; swq`N pv0, 2; 0, 1; swq`N pv0, 0; 0, 0; swqq 2ps´1qt 2 pN pv2, 1; 2, 2; swq`N pv2, 2; 2, 1; swq`N pv2, 0; 2, 0; swqq.
Finally, observe that summing Equation (7), Equation (8) and Equation (9) and then collecting the coefficients for each configuration, we obtain that Er p Icores " CpA, B; wq as desired.
We proceed with the proof of Fact 5.
Proof of Fact 5. For ease of notation, for a configuration c and a pair pi, jq Ñ c, let Since p ‰ i`w and q ‰ j`w, we have that X A i,p " 1 and X B j,q " 1 iff a i ă x, b j ă x, and a i`1 , . . . , a i`w , b j`1 , . . . , b j`w , with the exception of ap and bq, are each greater than x. This corresponds to 2w´1´s hash values needing to be greater than x. What remains is to compute how many hash values need to be less than x.
We now restate Lemma 3, whose proof is a direct consequence of Lemma 1.
Proof. Recall that M A p denotes the indicator random variable for Ap being a minimizer in A. Then From Lemma 1, we know that Er p IpA, B; wqs ě CpA, B; wq, and from Fact 4 we get that ř L´1 p"0 ErM A p s ď 2L w`1 . Combining these two facts, we deduce as desired. For the lower bound, from Fact 4 we can deduce that The lower bound then follows from Lemma 1.

A.2.2 Approximating the ratio of the minimizer union and intersection (Lemmas 4 and 5)
We begin this section with the proof of Lemma 4, where we obtain bounds for the variances of p IpA, B; wq and p U pA, B; wq.
Proof. For ease of notation, we let I " IpA, Bq and U " U pA, Bq. If p is a position in A, then define wp " tA maxt0,p´w`1u , . . . , A mintp`w´1,L´1u u and, if x " Ap, we say that the k-mers in wp are nearby x in A.
We begin with part piq. For ease of notation set p I " p IpA, B; wq and recall that Then, , wp X w q 1 " H, and w p 1 X wq " H, since these four conditions guarantee that the two windows of size 2w´1 centered at p and q (which determine M A p M B q ) do not share k-mers with the two windows centered of size 2w´1 at p 1 and q 1 (which determine M A p 1 M B q 1 ). Let D bet the set of tuples pp, q, p 1 , q 1 q such that p, q, p 1 , q 1 P r0, Lq, Ap " Bq, A p 1 " B q 1 and at least one of the following conditions hold: Then, V arp p Iq " Er p I 2 s´Er p Is 2 ď |D| and it thus suffices to derive an upper bound for |D|. To do so, we will count the number of tuples that satisfy each of the conditions on the definition of D and add them together together to get an upper bound on |D|. For condition piq, there are I values of pp, qq such that Ap " Bq, and for each one, there are 4w´3 possible values of p 1 such that |p´p 1 | ď 2pw´1q. Then, for a given value of p 1 , there is at most one value of q 1 that would satisfy A p 1 " B q 1 . Therefore there are at most p4w´3qI values of pp, q, p 1 , q 1 q that satisfy condition (i), i.e. Ap " Bq, A p 1 " B q 1 and |p´p 1 | ď 2pw´1q. By the same logic, there are at most p4w´3qI values of pp, q, p 1 , q 1 q that satisfy condition (ii), i.e. Ap " Bq, A p 1 " B q 1 and |q´q 1 | ď 2pw´1q.
For condition (iii), again there are I values of pp, qq such that Ap " Bq. Then, each k-mer x P wp can occur at most once in B, hence there are at most 2w´1 values of q 1 such that x P w q 1 . Since |wp| " 2w´1, there are at most p2w´1q 2 values of q 1 such that wp X w q 1 ‰ H. For each value of q 1 , there is at most one value of p 1 such that B q 1 " A p 1 . Therefore, there are at most Ip2w´1q 2 values of pp, q, p 1 , q 1 q that satisfy condition (iii), i.e. Ap " Bq, A p 1 " B q 1 and wp X w q 1 ‰ H. By symmetric logic, the number of tuples that satisfy condition (iv) is also Putting this all together, we get V arp p Iq ď |D| ď 2p4w´3`p2w´1q 2 qI ď 8w 2 I, which completes the proof of part piq. We prove part piiq next. For a k-mer x P U , let Ux be the indicator random variable for the event that x P p U pA, B; wq. Let D be the set of all px, yq pairs such that x P U , y P U , and Ux and Uy are dependent. Then, and V arp p U q " Er p U 2 s´Er p U s 2 ď |D|. It thus suffices to derive an upper bound for |D|. Let x and y belong to U . If Ux and Uy are dependent, then at least one of the following holds: (i) One of the sequences (i.e. either A or B) contains both x and y at a distance of at most 2pw´1q.
(ii) A contains x, B contains y, and the nearby k-mers of x in A intersect with the nearby k-mers of y in B.
(iii) B contains x, A contains y, and the nearby k-mers of x in B intersect with the nearby k-mers of y in A.
We will count the possible number of px, yq pairs that satisfy each of the conditions and use their sum as an upper bound on |D|. For (i), there are 2 choices for which sequence contains x and y, at most L choices for the position of x, and at most 4w´3 choices for the position of y. Hence, there are at most 2Lp4w´3q choices for x and y that satisfy (i). For (ii), there are at most L choices for the position of x. If y satisfies the condition, then there must exist a k-mer z which is nearby to x in A and also nearby to y in B. There are at most 4w´3 choices for z, and, for each of those choices, there are at most 4w´3 locations for y. Hence, there are at most Lp4w´3q 2 choices for x and y that satisfy (ii). Case (iii) is symmetrical to case (ii). In total then, |D| ď 2Lp4w´3q`2Lp4w´3q 2 ď 32w 2 L.
With these bounds for the variances of p IpA, B; wq and p U pA, B; wq we can now prove Lemma 5.
Lemma 5.ˇˇE Proof. We start by introducing some convenient notation. Let c " 6 ?
We say that p I and p U are good if their values lie in the range Er p Is˘cσ i and Er p U s˘c σu, respectively; otherwise we say they are bad. Let p R " p I{ p U . Note that ErRs " T 1`T2 , where Prr p I and p U are goods, Prr p I or p U are bads.
We will bound T 1 and T 2 separately. Observe that by Chebyshev's inequality Mitzenmacher and Upfal [2017], the probability that p I is bad is at most c´2 and the same holds for p U . Hence, a union bound implies that Prr p I or p U are bads ď 2c´2. Since p I ď p U , p R ď 1, and we obtain the following bounds for T 2 : Observe that for all x ą 0 and y ą 0, ?
x`y`?x`y " a 2px`yq. Then, using Lemma 4, we get: Furthermore, since every w consecutive k-mers have at least one minimizer, p U ě L{w, and so Plugging this bound into Equation (11)   where the last inequality follows from Equation (12) Now consider the case when Er p U s´Er p Is ě c pσ i`σu q. Using the fact that a b ď a`x b`x , for 0 ă a ď b and x ě 0, we obtain where the last inequality follows from Equation (12).
Putting the upper bounds on T 1 and T 2 together we get Combined with Equation (13)  Proof. Observe that for i P rw, L´1s and j P rw, L´1s, we have A i " B j if and only if pi´w, j´wq are in a configuration with C a,right " C b,right " 0. In the case that A and B are padded, then I " D and J " I 2L´I " D 2L´D . In general, the number of pi, jq pairs for which A i " B j and either i P r0, w´1s or j P r0, w´1s is at most 2w. Hence, D ď I ď D`2w. For the J lower bound, When D`2w ą L, then We note that it is possible to derive exact expressions for IpA, B; wq and JpA, B; wq for the non-padded case as well; however, doing so is not necessary for our purposes and would just introduce (even more) burdensome notation. Next, we need to prove two facts: Fact 2. CpA, B; wq ď 2L w`1 .
Proof. By Lemma 1, the definition of p I, and Fact 4, we have CpA, B; wq ď Er p IpA, B; wqs " ř L´1 p"0 ErM A p s ď 2L w`1 .
Proof. Note that under the given assumptions, y´x ě y{2 ą 0 and y´x´10 ě y{2´10 ą 0. Therefore, Now, we are ready to prove Theorem 1 Theorem 1. Let w ě 2, k ě 2, and L ě 7pw`1q be integers. Let A and B be two duplicate-free sequences, each consisting of L k-mers.
as claimed.
Proof. We omit the parameters A, B and w from the following for conciseness. Let d " 2 w`1 . Observe that the following statements are equivalent: Note that for the second equivalence, we rely on the fact BpA, B; wq is well defined and its denominators are not zero. In other words, 1) 2L´D ą 0 because D ď L (by definition) and 2) 2dL´C ą 0 because C ď dL (by Fact 2).
We now need to show that C ď dD. We have C ď Er p Is (by Lemma 1) ď Id " dD (by Lemma 6) Note that Equation (14) follows because of the fact that A and B are padded and Fact 4. Next, observe that since all the terms in Equation (14) are positive, the only way to have equality with Id is if each term PrrM A p " 1 | M B q " 1s is 1. We claim this can only happen if there are no shared k-mers between A and B, i.e. when JpA, Bq " 0. Otherwise, take the leftmost shared k-mer in A. The window to its left in A will be assigned hash values that are independent of the hash values in B; therefore, PrrM A p " 1 | M B q " 1s cannot be 1. Thus, if A and B share at least one k-mer, we get the stronger statement that Er p IpA, B; wqs ă Id. This in turn implies that C ă dD, which propagates to imply that B ă 0.

A.5 Proof of Theorem 4
Theorem 4. Let 2 ď w ă k, g ą w`2k, and L " ℓg`k for some integer ℓ ě 1. Let A and B be two duplicate-free sequences with L k-mers such that A and B are identical except that the nucleotides at positions k´1`ig, for i " 0, . . . , ℓ, are mutated. Then, where hpwq " pw`1qp1´2pH 2w´Hw qq 2 and Hn " ř n j"1 1 j denotes the n-th Harmonic number.
From this fact, which we prove later, we get that CpA, B; wq " dℓpg´kq`ℓf pwq, where d " 2{pw`1q and f pwq "´2 w w`1`w`5 pw`1qpw`2q`ÿ w´1 s"1 st 1`t2 p6w`8w 2´s ps`6w`1qq.  Table S3. Non-empty configurations appearing in the definition of C, along with their counts in the context of Theorem 4 as well as why the counts are zero, if applicable. The reasons are explained in the proof of Fact 8.
Note that since there are no matches in the first or the last k-mers and k ě w, we have by Lemma 6 that I " |A X B| " DpA, B; wq " ℓpg´kq and so CpA, B; wq " dI`ℓf pwq, From the definition of BpA, B; wq, we then have BpA, B; wq " C 2dL´C´I 2L´I " .
We also have the following closed form for f pwq (which we prove later).
It remains for use to provide the proofs of Facts 6 and 7. Fact 6 is a direct consequence of the following configuration counts.
Proof. We will refer to v2, 2; 2, 2; 0w as the empty configuration. Table S3 lists all non-empty configurations that appear in the definition of C. Sometimes, a configuration type is further sub-divided according to different values of s. We will show that the counts in the table are correct, which will prove the Theorem.
The rows that whose reason is VERT have configurations that match v˚,˚; 1, 0; sw, v˚,˚; 0, 1; sw, v1, 0;˚,˚; sw, or v0, 1;˚,˚; sw. These configurations never occur because in our setting, all the matches are parallel to each other (i.e. if A i " B j and A i 1 " B j 1 , then j´i " j 1´i1 ), while these configurations contain a 0 in one place (indicating that the matches are vertical, i.e. A i " B j implies i " j) and a 1 in another (indicated that the matching edges are angled, i.e. A i " B j implies i ‰ j). The rows whose reason is CROSS have a configuration that matches v1,˚; 1,˚; sw, v˚, 1;˚, 1; sw, v1, 1;˚,˚; sw, or v˚,˚; 1, 1; sw. These configurations never occur because the 1s indicate conflicting angles for the matches -they should either slant left (e.g. i ą j) or right (e.g. i ă j), but cannot do both. Note that for rows that could be categorized as both VERT and CROSS, the reason in the Table is arbitrarily chosen from those two. The rows whose reason is TOO-FULL have a configuration that matches v˚, 2;˚,˚; ww or v˚,˚;˚, 2; ww. These configurations can never occur because the presence of the 2 indicates that either A i`w or B j`w is not involved in a match, making it impossible that Spi`1, j`1, wq " w. The rows whose reason is TOO-EMPTY have a configuration that matches v˚,˚;˚, t0, 1u; 0w or v˚, t0, 1u;˚,˚; 0w. These configurations can never occur because the presence of the 0 or 1 indicates that either A i`w or B j`w is involved in a match, making it impossible that Spi`1, j`1, wq " 0.
By the definition of A and B from Theorem 4, we have alternating runs of k mismatches followed by g´k matches, with k mismatches at the end. Therefore, we have ℓ`1 blocks of k mismatches, at i P tig, ..., ig`k´1|0 ď i ď ℓu, and we have ℓ blocks of g´k matches, at i P tig`k, ..., pi`1qg´1|0 ď i ă ℓu. We will refer to the latter as match-blocks.
Recall that configuration windows are of length w`1. Because k ą w, no window can contain matches from more than one match-block. Moreover, any configurations involving an i or j in the first match-block will occur again in each other match-block, at the same coordinates modulo g. Thus it is enough to consider only the first match-block, and multiply the resulting counts by ℓ. We therefore restrict ourselves to the first match-block in the following discussion, and note that the leftmost match is at position k and the rightmost match is at g´1.
Let us consider the configurations that are v2, 2; 2, 2; ą 0w. In this case, A i ‰ B j and A i`w ‰ B j`w , and there is some i 1 P ri`1, i`w´1s and j 1 P rj`1, j`w´1s such that A i 1 " B j 1 . This match must be part of match block, and in our setting, a match block has width g´k. This is more than w, making it impossible that A i ‰ B j and A i`w ‰ B j`w . Hence N pv2, 2; 2, 2; ą 0wq " 0.
Let us consider the configurations that are v0, 0; 0, 0; sw. In these configuration, i " j, A i " B j , and A i`w " B j`w . A configuration window of width w`1 cannot span more than one match block, since g ą w. Therefore, A i`δ " B j`δ for all 0 ď δ ď w. Hence, the number of configurations with s ă w is 0. For s " w, Figure S2A shows all the configurations that are v0, 0; 0, 0; ww. We have that i P rk, g´w´1s, resulting in g´w´k possible windows with this configuration, in one match block Let us consider the configurations that are v0, 2; 0, 2; sw for 0 ď s ď w´1. In this situation, A i " B j and hence i " j. The match block containing this match ends before A i`w , since A i`w ‰ B j`w in this configuration. Then the rightmost match, A g´1 " B g´1 , must be somewhere in the window, other than at i`w. To get s matches, g´1 " i`s and thus i " g´s´1. Therefore, N pv0, 2; 0, 2; swq " 1 for each s P r0, w´1s. Figure S2B shows how this configuration looks like. The top and bottom drawings show the two end cases, while the middle drawing demonstrates the general case.
Let us consider the configurations that are v2, 0; 2, 0; sw for 1 ď s ď w. The case is mostly symmetric to the previous one. In this situation, A i`w " B j`w and hence i " j. The match block containing this match begins after A i , since A i ‰ B j in this configuration. The leftmost match in the match-block, A k , must be somewhere in the window other than at A i . To get s matches, k " pi`wq´ps´1q and thus i " k´w`s´1. Therefore N pv2, 0; 2, 0; swq " 1 for each s P r1, ws. Figure S2C shows how this configurations looks like. The top and bottom drawings show the two end cases, while the middle drawing demonstrates the general case.
Let us consider the configurations that are v2, 1; 2, 2; sw for 1 ď s ď w´1. Figure S2D shows all the configurations. There are several possibilities for each s. For s " 3, the top and bottom drawings show the two end cases, while the middle drawing demonstrates the general case. Because C a,right " 1, A i`w P tB j`1 , . . . , B j`w´1 u and j ą i. Since C a,left " C b,left " 2, A i ‰ B j , and the leftmost match in the match-block, A k , must be somewhere in the window, other than at i. To get s matches, k " pi`wq´ps´1q and thus i " k´w`s´1. The window for B can be positioned so that the leftmost match occurs in tj`1, . . . , j`w´su. Since this corresponds to A k , we have k P tj`1, . . . , j`w´su, which can be restated as pi`wq´ps´1q P tj`1, . . . , j`w´su. We can in turn restate this as i P tj´w`s, . . . , j´1qu and thus j P ti`1, . . . , i`w´squ. Therefore, N pv2, 1; 2, 2; swq " w´s for each s P r1, w´1s.
Finally, we consider the configurations that are v2, 2; 1, 2; sw for 1 ď s ď w´1. This case is symmetrical to the above case, by swapping the roles of A and B in the definition of the configurations. Therefore, N pv2, 2; 1, 2; swq " w´s for each 1 ď s ď w´1.
We are now ready to prove Fact 6. Fact 6.
Proof. Let us consider first the s " 0 case. By Fact 8, the only two configurations with s " 0 and with non zero counts are v2, 2; 2, 2; 0w and v0, 2; 0, 2; 0w. However, both of those terms are multiplied by s in W p0q, hence we have W p0q " 0.
Let us consider next the s " w case. For this value of s, by Fact 8, we have N pv0, 0; 0, 0; wwq " lpg´w´kq and N pv2, 0; 2, 0; wwq " l; all other configurations that may contribute to CpA, B; wq have zero counts.
We conclude this section with the proof of Fact 7.
We will now reduce each of the sums.
By using partial fraction decomposition, we can algebraically simplify each of the terms as follows: By plugging these expressions back into T , we get Now, we plug the value of T into f pwq and it finishes the proof, pw`1qpw`2q`3 pw 2`2 w´1q pw`1qpw`2q´2 pH 2w´Hw q " 1´2 pH 2w´Hw q .

Fig. S3
: Empirical bias for related sequence pairs, with and without duplicates. We set k " 16, w " 200, L " 10000, and r 1 P t.001, .005, .01, .05, .1u, with one mutation replicate. The duplicate-free sequence is the same as in Figure 4. The sequence with duplicates was found by choosing 100 random L-k-mer sequences from E.coli and choosing from those the one with the most duplicate k-mers (it had 1,377 duplicates, or about 14%). The colored bands show the 2.5 th and the 97.5 th percentiles. The evenly dashed line shows the expected behavior of an unbiased estimator, with s J " J.

A.6 Experimental details
In this section, we provide some experimental details to aid reproducibility. The scripts to reproduce our experiments are available on our GitHub paper repository.
Generative models: When we generate an unrelated pair, we greedily extend each string from left to right. At each position, we choose, uniformly at random, one of the nucleotides that would not result in a k-mer we have already seen. If we get to a point where all the possible nucleotide extensions to a string are already present, we discard the string and start from the beginning. Though this sampling scheme is not guaranteed to terminate, we found that it always did in our experiments. We also verified that the Jaccard of the generated pair was close to the j that was used as a target. Under the assumptions that A and B are uniformly chosen, j should be the expected value under the generative process. Though it is not clear that the uniformity assumption holds in our generative process, we found that the true Jaccard was indeed very close to j in practice. In the related pair model, we also faced a possibility that after choosing to mutate a position, all the possible nucleotide substitutions would create a duplicate k-mer.
In such a case, the position was left unchanged.
Mashmap divergence experiment: We sampled 100 substrings from the E.coli reference E.Coli download link, each of length L " 10, 000 and, for each substring and for each r 1 P t0.90, 0.95, 0.99u, generated a "read" which was the substring with r 1 L positions randomly picked and mutated. We then mapped it with mashmap, and discarded any read for which mashmap did not correctly identify a unique and correct mapping location. Mashmap was run with default parameters of k " 16 and w " 200.
Correction formula to remove Poisson-approximation from Mash distance Let j be the observed Jaccard. Let A and B be two sequences generated using a simple mutation process, i.e. a substitution is created at every nucleotide with a given probability r 1 Blanca et al. [2021]. The method of moments Wasserman [2013] estimator for the sequence identity is p imom " p1´n{Lq 1{k , where n is the observed number of mutated k-mers Blanca et al. [2021]. In the simple mutation model, the observed Jaccard j is related to n via j " L´n L`n , or, equivalently, n " Lp1´jq

1`j
Blanca et al.
[2021]. Putting this together, we get that p imom " p1´1´j 1`j q 1{k " 2j 1`j 1{k . On the other hand, the Mash distance estimator is´1 k logp 2j 1`j q (Formula 1 in Jain et al. [2017]), which equivalently translates to the identity estimator p i mash " 1`1 k logp 2j 1`j q. Combining the two, we get that p i mash " 1`1 k logpp p imomq k q. Solving for p imom, we get the final correction formula: p imom " e p i mash´1 .
Sliding read experiment: When choosing A, we avoided segments with any Ns or any duplicate k-mers. Any k-mers in B containing an N were hashed to the maximum hash value so as to avoid them being a minimizer. Also note that minimizers were computed separately for each B; thus, it is possible that the same k-mer might be a minimizer in one B but not a minimizer in a nearby B.
Empirical bias for related sequence pairs, allowing duplicates: The sequence chosen as the basis for the related experiment in Figure 4 did not contain duplicates, by chance. We wanted to check the extent to which this experiment would have been affected by duplicates. We chose 100 random sequences from E.coli and, from those, chose the one with the most duplicate k-mers. It had 1,377 duplicates, or about 14%. Figure S3 compares the bias for this sequence to the duplicate-free one in Figure 4. There is almost no visually discernible difference between the two.